GH-48251: [C++][CI] Add CSV fuzzing seed corpus generator by pitrou · Pull Request #48252 · apache/arrow

pitrou · 2025-11-25T14:44:25Z

Rationale for this change

The CSV seed corpus for fuzzing currently consists of sample data files from the Pandas project and our own testing repository. This PR adds an executable that generates custom seed files with well-defined characteristics designed to exercise the various data types that the CSV reader is able to infer automatically.

This PR also switches the RandomArrayGenerator facility to the non-"fast" PCG random generators, which give better output especially relative to the seed. This requires some minor changes in the tests to workaround some issues that changing the random generator surfaced.

Are these changes tested?

By existing tests.

Are there any user-facing changes?

No.

GitHub Issue: [C++][CI] Add CSV fuzzing seed corpus generator #48251

pitrou · 2025-11-25T14:50:22Z

cpp/src/arrow/testing/random.cc

  GeneratorFactory(ValueType min, ValueType max) : min_(min), max_(max) {}

-  auto operator()(pcg32_fast* rng) const {
+  auto operator()(pcg32* rng) const {


It turns out pcg32_fast is not high quality. When used with RandomArrayGenerator::Strings, the first string character would very often be A...

pitrou · 2025-11-26T10:44:42Z

@github-actions crossbow submit -g cpp

github-actions · 2025-11-26T10:47:22Z

Revision: bcce6c5

Submitted crossbow builds: ursacomputing/crossbow @ actions-e5f01b72d0

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-42-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

pitrou · 2025-11-27T14:10:06Z

@github-actions crossbow submit -g cpp

github-actions · 2025-11-27T14:12:55Z

Revision: 7d45596

Submitted crossbow builds: ursacomputing/crossbow @ actions-09618dfadc

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-42-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

zanmato1984

Generally lgtm. Some minor questions.

zanmato1984 · 2025-12-01T08:14:49Z

cpp/src/arrow/csv/generate_fuzz_corpus.cc

+    ARROW_ASSIGN_OR_RAISE(auto buffer, WriteRecordBatch(batch, options));
+
+    ARROW_ASSIGN_OR_RAISE(auto sample_fn, dir_fn.Join(sample_name()));
+    std::cerr << sample_fn.ToString() << std::endl;


Why use standard error rater than standard out?

No precise reason, this is the same thing we're doing in other fuzz corpus generators.

cpp/build-support/fuzzing/generate_corpuses.sh

zanmato1984 · 2025-12-01T08:16:31Z

cpp/src/arrow/csv/fuzz.cc

+  read_options.block_size = 1000;
  auto parse_options = ParseOptions::Defaults();
  auto convert_options = ConvertOptions::Defaults();
  convert_options.auto_dict_encode = true;
+  convert_options.auto_dict_max_cardinality = 50;


Why do we need these changes?

The block_size one is to increase the likelihood of chunking and the number of chunks, to exercise chunked reading and parallelization more. The auto_dict_max_cardinality just explicitly sets to the default value, so it's really a no-op but it signals a knob that we might want to turn.

For the record, most files generated by this PR are 5-10 kB in size.

zanmato1984

LGTM

pitrou · 2025-12-01T08:40:50Z

@github-actions crossbow submit fuzz

github-actions · 2025-12-01T08:43:01Z

Revision: 2f092c2

Submitted crossbow builds: ursacomputing/crossbow @ actions-805c4b6939

Task	Status
test-build-cpp-fuzz

conbench-apache-arrow · 2025-12-16T22:38:34Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit a32730c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 105 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: C++ awaiting review Awaiting review labels Nov 25, 2025

pitrou commented Nov 25, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 25, 2025

pitrou force-pushed the gh48251-csv-seed-corpys branch 5 times, most recently from 3451a5c to bcce6c5 Compare November 26, 2025 10:12

pitrou added the CI: Extra: C++ Run extra C++ CI label Nov 26, 2025

pitrou marked this pull request as ready for review November 26, 2025 11:56

pitrou requested a review from zanmato1984 November 26, 2025 12:00

pitrou force-pushed the gh48251-csv-seed-corpys branch from bcce6c5 to 7d45596 Compare November 27, 2025 14:09

zanmato1984 reviewed Dec 1, 2025

View reviewed changes

pitrou added 7 commits December 1, 2025 09:27

apacheGH-48251: [C++][CI] Add CSV fuzzing seed corpus generator

df82105

Try to fix test on Windows

29dcad9

Add debug for Windows failures

ac4401d

Undo debug additions

204aa75

Try using AssertWithinUlp

149e3a4

Relax check

11903a2

Push fix for imprecisions on macOS

6700e75

zanmato1984 approved these changes Dec 1, 2025

View reviewed changes

Add comments

2f092c2

pitrou force-pushed the gh48251-csv-seed-corpys branch from 7d45596 to 2f092c2 Compare December 1, 2025 08:34

pitrou merged commit a32730c into apache:main Dec 1, 2025
46 of 47 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Dec 1, 2025

pitrou mentioned this pull request Dec 1, 2025

[C++][CI] Add CSV fuzzing seed corpus generator #48251

Closed

pitrou deleted the gh48251-csv-seed-corpys branch December 1, 2025 09:14

Conversation

pitrou commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pitrou Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

pitrou commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

zanmato1984 left a comment

Choose a reason for hiding this comment

Uh oh!

zanmato1984 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zanmato1984 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

zanmato1984 left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Dec 1, 2025

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pitrou commented Nov 25, 2025 •

edited

Loading